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CURRENT ISSUES IN TESTING, JffiftStJREMENT, AND EVALUATION 

S; Donald Melville ^ Director, ERIC Clearinghouse on Testa, Meastiremeht and 
Evaluation^ Educational Testing Service, Princeton, NJ; Jacob G. Beard, 
Florida State University; C. Philip Kearney, University of Michigan; Rodney 
Rothj University of Alabama; jason Millmah, Cornell University 



Four educators describe the issues which they see to be most current in 
the fields of testing, meastiremeht and evaluationi The mastery of basic 
skills, defined by mihimrJn levels of competence, is discussed by jaCob G. 
Beard, in "Minimum Cbmpetehcy Testing.* issues such as accountability, 
socifl policy^ ihstructiohai implications, and psychometric issues are 
^rought to bear on the subject. C. Philip Kearney, in "Assessment of Highir 
Order Skills," examines a set of problems more complex than those involved 
with assesing basic skills i A clear definition of what cbnstitutei higher 
order skills^ a sound curriculum design, and available ihstrximints for 
assessing higher order skills are among the goals ^^hich iust be achieved to 
adequately teach and test higher order skills. In "Testing Teachers for 
Initial Certification," Rodney Roth points out some of the concerns related 
to testing teachers before they begin to practice their profession. Two 
major trends, using the National Teacher Examinations from Educational 
Testing Service and using state programs to develop teacher certification 
tests, are presented. A state-of-the-art survey by Jasoh Millman^ 
"Educational Testing and the Computer," describes computer-assisted iduca- 
tional testing as it is used for writing test items, constructing tests, 
administering tests, scoring and analyzing results, and record keeping. 



Minimum Conpetency Testing 

the last decade many school systems began definin^j minimum 
levels of competence for their students and constructing tests to meastire 
whether the students had achieved these minimums. These minimum competencies 
usually included the basic skills of reading, writing, and arithmetic and 
their aisplication. The term "minimum competency testing" acquired ipi5ial 
meaning from this actijvityi Considerable controversy arose when, in 1976 
Florida passed a law requiring high school students to ^is a minimui 
competency test in order to graduate. A Class-action lawsuit was brought 
against Florida^s school system in an effort to block the uii of the test as 
a graduation requirement. The courts Upheld the rights 5f school systems to 
establish minimum standards of competency for graduation, and iany other 
states now have similar laws. The controversy has continued and is focused 
on the following issues; 

Accountability 

During the 1970- s there was 5bniiderible criticism of the schools and 

accusations of lowered achievement. To many, minimum competency testing was 
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seuh as a means of hbldihg tJie schools accountable for graduation of 
literate students who could perf bra the basic skills of reading ^ writing ^ 
and arithmetic. All students would be tested for minimum competencies and 
failures would be remedied_ before graduation. Students who were \zhabie to 
remedy their weakhesses and pass the test before graduation would be given 
certificates^ but not high school diplomas. 

Many educators have expressed concern about the effects of minimum 

competency programs on the overall school curriculum^ and_ the level of 
achievement resulting from the programs. There is speculation that the 
minimum will become the maximum competencies at the expense of higher 
learning levels. Such ah effect has not been demonstrated; however, some 
political and educational leaders have responded to the concerns by adding 
testing programs measuring higher levels of achievement. 

Statewide miiiii^ bompe ten is inconsistent with the cdhcept 

of local control i Some freedom of districts to determine what is taught. in 
the schbbismust be relinquished to the state when state testing. programs 
are established. Howevc . , the curriculum for most schools is already rather 
fully determined by state and national policies. The idea of each school 
district's separately determining a unique curriculum is not consistent with 
current practice. 

Social Isstlss 



Minimum competency testing is seen by some as social policy. Cohen 
and Haney ij980)_argued that it was another in a long line of educational 
minimums which began when elementary education was made compulsbry and was 
followed periodically by increasing requirements for formal education. 
Previous minimums have been phrased in terms of age or years of schooling. 
Cohen and Haney point out that while the establishment of official minimums 
has the appearance of equalizing achievement, history shows tixat it merely 
initiates a new cdmpetitibh fbr superibrity* 

Minimum cbmpetehcry testing has also been characterized by its opponents 
a racist means of denying educational credentials such as high schbbl 
diplomas to minority, and particularly black, students. This argument is 
based on the historically greater failure rate of black than of white, 
students bh_ thes^ and other academic achievem tests. Propbhents of 
minimum competency testing argue that it is a means of improving the achieve- 
ment of marginal T^dents by identifying achievement deficiencies and 
ensuring tiiat all students receive a basic education. 
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In s tractional Implica ^ohs 

Minimum Cbiaf^tehc3' T«sting programs must be coordinated with the 
inistructibhal program. The tests must have both curricular and instruc- 
tibiial validity. That is, they mst measure instructibhal bbject:lves which 
are included in the established curriculum and which are actually taught to 
all of the students. 

\ 

......... t 

Remedial instruction should be made available to students who fail the 
test before retaking it. This usually requires additional funding to ensure 
that adequate remediation is given. 

_i i^*^^*sic premise of educatibhal systems which adopt minimum competency 
testing is that credit shbuld be given for accomplishing instructional 
objectives rather than for spending time in programs. This idea leads 
naturally to the implementation of various instructional design concepts 
such as: diagnosis and prescriptive learning, individualized instruction, 
and optimally designed instructionalmaterials. These concepts have been 
introduced before, but have had limited success in achieving widespread 
or long-term implementation. However, effective minimum competency 
testing virtually necessitates the use of such systems. 



Psychometric Issuea 

* minimum competency tests are used to make decisions having serious 

consequences for students, the fssychometric properties of the test scores 
become especially important. Individuals denied high school diplomas oh the 
basis of mihizaum cbmpetehcy te sued the educational system. 

They have charged that the_use of inadequate tests constituted viblation of 
the due process and equal protection clauses of the Fourteenth Amendment to 
the Constitution. Therefore, users of such test results should make sure 
that the testing program conforms to the standards bf quality set forth by 
the testing profession.. This includes adherence to the Standards for 
Educational and Ps y cholo gic a l T e sts published jbihtly by the Azaerican 
Psychologicalftssociation, American Educational Research_ Association, and 
NationaiCouncil on Measurement in Education (1985). The following criteria 
are especially important for minimum cbmjpetehcy tests. 

o The tests must have content, curricular, and instructional validity; 
that is, they must test material which has been taught to all the 
students. 

b Students must be given adequate warning of new standards for 



o The test scores %^ich assign students tb the categories of ^ss or 
fail must be reliable for that purpose. 

o The passing score representing the achievement of minimum competency 
must be arrived at ratibhally and the level of skill it represents 
must not fluctuate frbm one test admi^^ls taction to another. 
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o The test must not contain items which are biased for or against any 
racial I ethnic 9 religious* sex, or other group through characteristics 
other thaii the measurement of stated instructional objectives i 

b Absolute security of the tests must be maintained. 

6 Test administrations must be standard at all testing sites. 

Trends 

_ Minimum competency testing continues to be used as a requirement for high 
school graduation and has been introduced at other levels of education. For 
example, several states have installed tests of minimum competency for college 
sophomores and for teacher certification. 

Several states have responded to con cerhs about lowered achievement 
expectatidns by initiating testing programs which measure levels of achieveooient 
beyond the basic skills within, or in addition to, t?ieir miniiaum competency 
testing programs. 



Assessment of Higher Order Skills 

The teaching and testing of higher order skills is fast taking on the 
characteristics of a nationwide educational reform movement. Several states 
f^e developing and implementing programs aimed at assessing higher order 
skills, local school districts are moving rapidly to adopt currlcular 
programs that emphasize the teaching of higher order skills, educational 
textbook publishers and testing companies are becoming increasingly active 
in this area, and conferences, symposia, and workshops on this topic are 
springing up all across the land. 



The growing concern oyer higher order skills stems principally from a, 
recpgnition_that the nation* s pupils, i^ile demonstrating modest improvement 
in the basic skills, are falling far short of achieving mastery of thinking 

^® ™^3^r ihstructibhal goals of schooling, 
^^^^^j^^ ™P^® support this cdntehtibn~a decline in SAO? and ACT 

^9®?"®?.^?®^ ^® P^s't several years, results frbm the National Assessment of 
Educational Progress demonstrating a lack of analytical skills among the 
nation's pupils, and results frbm state testing programs suggesting short- 
comings (Harhischfeger 5 Wiley, 1975; KAEP, 1981; Barbh^ 1985) i 

The higher order skills are increasingly becoming a principal focus of 
state level assessment effbrtis^ a phenomenon which bodes well for those 
advocating a strong <nirricular emp^ the higher order skills—for tests 

drive^yie cnirriculM, p^ state tests. What the state tests 

determin^ in large part, ^at the schools teach and the relative degree of 
emphasis placed on the subjects and areas tested in relation to other subjects 
and areas of the curriculum (Rudman, 1985). 



6 



XV-5 



_ .^However, the assessment of higher order sfcilis—whether at iocai, statCj 
or_nat±onal levels-eposes problems that are more complex and substantially 
different from those posed ^ ths assessment of b^^ skills and other 
subjects traditionally found in the school curriculum. The first of these 
problems centers oh the lack of clear difinition of what constitute higher 
order skills. What precisely ls_it we are talking about when we use the term 
"higher order skills-? A second problem is whether we are better advised to 
teach~ahd test—higher order skills_ as a separate subject in the curriculum^ 
divorced from particular content areas such as reading, mathematics, and 
science; or whether we_are better advised to teach and test higher order 
skills as an integral part of one or more subject areas. A third problem 
focuses on the_availability, or unavailability, of instruments to assess 
student attainment ±n the higher order skills. Is there a need for 
considerable test development work or are valid and reliable measures already 
in existence? And there are other problems—for example, questions of a 
-one-tiered" versus a "two^tiered-_approach^(^ of basic skills, then 

mastery of higher skills). Still other problems: the costs and benefits of 
using writing samples in measuring these skills, and questions of every-pupil 
testing versus a issiinplihg of pupils. 

The problem of lacJ^of clear definition is particularly acute^ -Higher 
order skills" is oneterm used to describe thinking skills. Other terms 
abound— critical thinking, higher order thinking skills, higher level Scilis, 
reasoning, intelligence, creative thinking, lateral thinking, informal logic, 
to name a few. The problem is hot only to decide among these rames but, 
perhaps more importantly, to choose what definition or conception of thinking 
will guide teaching and testing activities^ At the present time, there seems 
to be little if any consensus oh names or definitions. For the parent^ the 
answer is easy: "What I want is for you to teach my child to think." For 
the profession ^ the answer is much more complex. It includes such hbtibns 
as a habit of reflective thinking; a disposition or willih^-hess to think 
critically, assertively^ and habitually; more difficult subject matter 
content; critical reasoning skills; skills that go beyond recall or learning 
of facts; and a literal laundry list of other cognitive activities (Beyer, 
1983; kean, 1985). One acknowledged leader in the field chooses the 
term -critical thinking- and defines the concept as "reasonable reflective 
thinking that is focused oh deciding what to believe or do" (Ennis, 1985). 
Another defines "thinking" as "the operating skill with which intelligehc^ 
acts upon experience" (de Bono, 1983 )i Still another offers a definition of 
-higher order thinking skills- as: 

those skills that go beyond ste recall or learriihg 

of facts.. i. problem identification and prdbleiL solving^ 
evaluation of information and of arguments » deduction ^ 
inference, taking_ alternate points of view^ creating 
reasonaMe arguicents in support of a position^ and meJcing 
decisions. (Freiaer S Daniel, 1985} 

Thus, when it comes to defining precisely what thinking skills mean, it_ seems 
there is no consensus but great diversity in both terms and concepts. For 
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those who would include higher order skills in a state assescmeht program, 
the first task is one of settling on a meaningful and useful definition. 

The secondproblem, whether the higher skills should be taught and 
tested as a separate Bubject area or embedded or infused into existing 
subject matter and tested in like fashion^ also lacks resolution, even 
though most people favor the latter. Still, the former approach, teaching 
and testing thinking skills as a separate topic area, has strong support 
among several leaders in the field. Sternberg » for example , argues that 
the better strategy is one that assumes intervention at the level of mental 
processes, and that jjujsils can be taught when and how to use particular 
mental processes^ and how to_combine those processes into strategies that 
lead to problem solutions (Stembergj 1984) . He argues f or three programs 
to teach the components of intelligence^^intelligeM being his choice of 
haare Md definition of higher ord skills. The three are Feuersteia's 
"Instrumental Environment^" Lipman's "Philosophy for Children," and "The 
Chicago Mastery teaming program" (Sternberg, 1984). Ahbther acknowledged 
leader in the field, Edward de Bbhbi also argues for the direct teaching of 
thinking as a ^ill; he calls for setting aside a place in the school program 
so that pupils, teachers, and parents will recognize that thinking skills are 
tstught directly (de Bono, 1983J* _However, de Bono is much less sanguine about 
ability to assess thinking, Heargues that our present measures are hot up 
to the lob because they do no t observe the thinker • s composite perf brmahce . 
A third acknowledged leader^ Robert Ennis, supports the ihclusibh of critical 
thinking as an inherent part of traditional subject matter, even though some 
contend that he_ favors both approaches (Ennis, 1985; Barbh 1985). While 
there is ample evidence that either approach can work, most research seems 
to support Eimisrs view— -namely, that instruction in thinking skills should 
be present across subject areas and throughout the grades (Beyer, 1983; ETS, 
1984; Fremer S Daniel, 1985; Keah, 1985). 

Still, Connecticut^ in its state level assessment programs, is using 
both approaches apparently with equal success. It systematically integrates 
higher order thinking skills into its_ assessment of the subject matter 
dbmaihs covered in the ongoing Connecticut Assessment of Educational Progress 
while, at the_same_ time, it explores a variety of additional formats to 
measure_criticai_ thinking and reasoninq skills separately and more directly 
initsnewiydeveioped Mastery Testing Prbg^^^ (Baron, 19851. Michigan, on 
the other hand, is moving to test thinking skills aspart of a revised 
evary-pupil reading and math assessment to be administered at grades 4, 7, 
and 10 and as a newly developed every-pupii writing assessment at grades 5, 
8, and 11 (Michigan Department of Educatibni 1986) i In Florida, the emphasis 
also is on testing higher order slcills within content areas (Fremer & Daniel, 
1985). Thus, while we see both approaches pursued in the assessment of 
higher order skills, current practice seems to give an edge to teaching and 
testing such skills as embedded parts of traditional subject areas. 
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The third problem^ whether instruments currently available are adequate 
for assess ing^igher order skills, also admits of different responses, some 
argue that commercially available standardized achievement tests include items 
that measure higher order skills, and that scores aSd sub-scores from these 
instruments can provide useful and valid information on pupil attainments of 
higher order skills (Ffemer S Daniel^ 1985; Kean, 1985). others contend 
there are no topic-specific critical thinking tests available, but only tests 
which attempt to cover critical thinking as a whole, or focus oh one aspect 
of critical thinking (Ennis, 1985). still others--particularly those who 
develop and implement state level assessment prbgrams--argue that, while much 
developmental work remains^ there are measures of higher order skills that 
can be incorporated into ongoing programsj so state level efforts need not 
wait on long-term developmental efforts (Barony 1985; MDE, 1986) i 

^S^'^^ P'her problems. Should there be a two-tiered approach? 
Should higher^order skills be assessed only after a pupil has demonstrated 
mastery of the basic skills? Should^iting samples be used to assess 
higher order skills? if so, what form should these take and how should they 
be scored.'' Is it important to test every pupil at every grade level? Or 
can the state accomplish its pi^poses by sampling grades and sampling pupils? 
While research can helpful in addressing problems of these types, their 
ultimate resolution may depend more on the policy values and policy culture 
prevailing in any particular state. 



Testing Teachers for initial Certification 

Testing teachers before they begin to practice their profession is not a 
recent phenomenon. The first official endorsement of teacher testing occurred 
in the colonial era (Void, 1985). The General Assetfibly of Virginia in 1686 
requested that every county appoint a person who would examine and license 
schoolmasters. The testing of teachers for county certif icati5h was dominant 
throughout the United states from I860 until the early 2dth centuiy. 

The development of normal schools to train teachers and the approval 
of teacher training programs by state departments of education led to ah 
elimination of testina teachers for certification hy the 1920s. The American 
f°Tf^« on Education did, however, establish the National Teacher Examination 
tn I940i Initially, it was used by local school districts to helfs with 
teacher selection; Only recently has it been used for certification. 

- The testing of teachers for certification has resurfaced In the Mit 
decade; a majority of states currently test teachers for certif icati5n and 
-lore states plan to start. The rebirth resulted from several major factors 
Two of these factors were declining test scores and ah oversupply of teachers, 
another was the large scale press coverage giveh to a very few letters written 
by teacners to parents. The letters contaihed errors in graiiar and spelling. 

The rest of tois section will preseht two ma j5r trends and procedures in 
the testing of teachers for initial certification and briefly discuss some 
current problems or dilemmas facihg policy makers, Researchers, and persbhs 
involved with teacher testihg. * 
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Major Trends 

One trend is to use the National teachers Examinations (NTE) from 

Educational Testing Service (ETS). The use of this test can be traced, 

in part, to two court decisions from the Carolinas. south Carolina started 

using the NTE to assign different grades of teacher certificates shortly af te: 

it was developed. The type of certificate affected salaries and salary 

increases. 

w issued^idelines stating that passing scores br OTt-scores 

should be based on validation studies. In 1975, a District Court in North 
Carolina issued a decision requiring objective proof by the state of North 
Carolina of the relationship between the minimum sc5^e requirements on the 
NTE and the Staters objective of certifying teachers who were at least 
minimally competent. Based on this decision. South Carolina authorized an 
NTE validity study by ETS. 

The validity study conducted by ETS assessed the extent to Which the 
content of the NTE tests represents the content of the teacher training 
programs. Teacher educators were asked to make several judgments about the 
overall test specifications and teacher training programs. They were further 
asked to review each question on the test and judge its appropriatenii. ft 
question^ was considered "content appropriate" if at least 51% of the fudges 
indicated that at least 90% of the students would have had an opportunity to 
learn the content. 

CTit- scores derived from the validation itudy and adopted by South 
Carolina for initial teacher certification were challenged in court. In 
January, 1978, the United States Supreme Court announced that it had 
affirmed the April, 1977, decision of a Federal Sistrict Court Upholding 
South Carolina's use of the NTE for certification. This decision prompted 
several ether states to adopt the nte with cut-scores based on similar 
validation procedures. 

: The United States goverSient issued the ^ifbrm Guidelinei bn Employee 
=^AaB^rocedures just after the Supreme Court decision on the NTE use in 



South^ Carolina. These Guidelines apply to tests used for hiring , promotion . 
and licensing and certification to the extent that licensing and certifica- 
tion may be^covered by Federal equal employment law. These Guidelines require 
^JaJ^!"^^*^^ in terms of job relatedness. This prompted Roth 

(1982) to develop a new Validation procedure for his NTE study for the state 
or Arkansas* 

4* This NTE study used teachers and teacher educators to judge each test 
item. The judges rated the relevance of the content measured by each 
question againct the domain of knowledge they believed essential for a 
minimally qualified entry- level person. Most NTE Validit- studies done 
sinc-s 1982 haVe assessed both job relevance and the relationship to teadher 
training pfbgfams. 
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Anotder current trend is for states to develop their own teacher 
eertificatibh tests. In practice, this typically means that states contract 
with the Mational Evaluation Systems (NES) for test development and subsequent 
scoring and reporting services. Georgia was the first state to develBp iti 
own tests for teacher certification, interestingly, Georgia decided not to 
use the NTE4 This was based in part on a cotirt decision concerning its use 
of the NIE for awarding an advanced teacher certificate. Georgia had 
selected an NTE cut-score that was hot based oh a validity study for the 
certificate. In January, 1975, a District Court mled that the test had no 
rational relationship to the purpose of the certificate. The Court also 
indicated that a state must show a valid relationship between a general 
national examihatibh and the specific duties perfonned 1^ a teacher in the 
state. 

States that develop their own tests" typically use procedures follbwihg 
^® Uniiform Guidelines o n E mploy e e Selectio a Procedures . This means that 
the tests are designed based on the Itnowiedge heeded to teach a ipi5ific 
subject in the state. Elliott (1986) fjresehts various procedures used by 
several states to develop their own teats. The key com^hiht in these 
procedures is a job analysis. It includes some detiraihation of the critical 
and frequently performed elements of the job. The job analysis typically 
begtHB with a large number of content or topic objectives derived by content 
experts to define the scope of the teaching field. Teachers rate each 
objective according to its essentiality and the amount of time spent teaching 
-the content. The results of this process determine the specific objectives 
for which test items^ are developed. The iteios are evaluated for their 
congruence with the objectives. The remaining items are field-'^ested in 
order to produce appropriate item and test statistics. Thise results are 
used to produce the final or actual certification test; 

Problems or HtT^amnH^g^ 

ftt the outset, a major dilemma faces^ policy makers who must choose 
whe tiier to use the NTE or develop their own test. Some of the advantages of 
the NTE are that the test is available; it is administered by a large and 
creditable testing firm; it has been used for over 45 years; and its use was 
upheld by the Supreme Court. One disadvantage is that approfiriate tests are 
not available for certain certification fields, ih addition, state validation 
studies that use current validation guidelines might indicate that the NTE 
is not appropriate or that the derived cut-scores are extremely low. 

The major advantages of state-developed tests are ^at the tests can 
b^ d«veloped for each certification field and the tests cover the essential 
knowledge needed to teach a field in the State, The major disadvantages are 
the time and cost involved for test dsvelopmeat. ft potential problem is 
that etate-developed procedures have hot been tested in the courts. 

- a second problem for fkslicy makers concerns what to test. Some itates 
ttos content in the certification field; other states test professional 
knowledge; and still others test general knowledge. The professional and 
legal guidelines for esiployment testing seem to indicate that the further 
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one moves away from the iSjpecific content needed for the position, the more 
difficult it is to show job relatedhessi For example, potential inath teachers 
should have literature as part of their training program. Should they, 
however^ be tested on literature as well as math in order to be certified to 
.teach math? 

i^^ Mjor problem for educatibaal researchers and people who develop 
state tests or validate existing tests is to determine what guidelines and 
standards are appropriate. The Suisreme Court decision for South Carolina 
indicates that a validity study based oh the teacher training programis 
appropriate. The ^^f^rti Guidelines would seem to indicate that the South 
Carolina procedure was not appropriate^ Rebel! (1986) states the problem by 
saying that regarding the iaw,^there is an unresolved technical issue whether 
Title VII and the Equal Employment_ Opportunity Commission (EEOC) Guidelines 
apply to licensing or cnredentiaiingex^ He also raises a question 

of precisely how those validation standards, that were created largely in 
the context of individual employer job selection tests^ should be implemented 
in the_conceptually distinct licens or credehtiaiihg contexti The 1985 
Standards for Educationa l arid Psychblbgical Tests (American Psychological 
association) have also added a section on.^ofesiioSai and occupational license 
and certification. These standards seem to indicate similar procediires 
found in the Wi4fbrm-Guidelihes . The impact of the Debra P case in Florida 
on certifiMtidh testing is another unknown variable. It reintroduces the 
questibh of curricular ahd/br instructional validity. 

^^ter the validation guidelines or test development jjrocedures have been 
decided, a new series of decisions has to be made. These cbhcem professional 
judgments that have to be fought out during the prbcess. Some ixamples are: 
Should the percentage who typically answer an item correctly be prbvided for 
the judgeswho are making item probability estimates; what is an appropriate 
standard to judge item relevance, br item essentiality , or content coverage; 
and what roles should various standard errors have on the process. 

Co nclusion 

Certification is intended to prbtect the public. Teachers, like most 
professibhs^ should be tested for initial certif icatibn. The problems 
associated with the prbcess are complex, but not uiisblvable. 

Solutions are needed because society can neither afford to have 
incompetent teachers teach our children, nor can it afford tb deny competent 
persons the chance to practice their chosen profession. 



Educational Testing and the Computer 

Computers are involved in educational testing in five areas: (a) writing 
the test items^ (b) constructing the tests, (c) administering the tests, 
(dj scoring the tests and analyzing^ and_ interpreting the results i and 
(e) keeping test records. This survey describes the state of the art with 
respect to cbmputer-assisted educational testing. 
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Writing the Test Items 

Of the five areas, the writing of items has been least influenced by 
computers • Thus far^ the potential of the computer to compose item content 
has hot been realized ^ 

The first attempt at computer-genera item writing took place in 1968 
when_two educational researchers, H.G, Osbum and David ShdemaJter, working 
under a_Ui Si Office of Education grant, developed a scheme by which the 
computer would construct questions about statistics • This scheme worked by 
completing a fixed part of the question called ah item shell with words or 
numbers randomly selected from a set of possibilities called a replacement 
set. For example, a true- false question might be generated. 1^ the computer 
by putting together the shell| "The middle number in a distribution is 
called the** and a randomly selected word from the replacement set, "mean, 
inediah^ mode*" Note that in this simple exaiapie three variations of 
the true- false question are possible. 



z _ ?"i_?®P?'^^???'*^ schemes, every word that appears iii 

a_ test question is first thought of by the item writer and entered into the 
computer* The computer is relegated to the trivial task of picking the words 
or numbers and putting them together using straightforward algorithms to 
produce the test questions. Although some attempts to have the computer 
"think" like a test cbhstructpr have been carried but^ for the present the 
computer provides scant practical help to the item writer. 

Coils tru etih^ the Tests 

The computer is used extensively to build tests, especially by commercial 
publishers and governmental agencies. This application is made possible by 
qQlleqtions of items called item bardcsi Occasionally, items are kept only 
?D_^??^z^!^l^®_^®???®"!^?i9" of each item — its statistical properties, 
content descriptions, and so forth— are fed into the computer. The computer 
then can pick a collection of items that meets the statistical and cbhteht 
specifications of the test builder. It is then left to the test cbhstructbr 
to assemble the test manually. 

More cbmmbh, howeveri is the situatibh in which the items themselves 
are entered into the cbmputer^ tbgether with several pieces of documentations 
When the items are stored^ the cbmputer can both select appropriate items and 
construct and print the test itself^ the successful and extensive use of the 
cbmputer tb assiunbie tests is in contrast to its minimal use to write itema. 



Ins true tors ^o teach the same subjects may develop an item bank which 
they shares _Sometimes they obtain the item bank from a state or local agency 
or from a commercial source; at other times they construct their own items i 
perhaps beginning by using items available from others. The Northwest Regibnai 
Educational Laboratory, 300 SW Sixth Avenue i Portland ^ dregbh 97204^ provides 
listings of available item banks and reviews bf existing microcomputer programs 
that will construct tests from item banks. Most of the progretms are too limited 
tb be very usefial. A few b£ the more recent ones, however show promises 
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_Millman and Arter (1984) provide detailed information about iteia banks 
and test constraction* They describe a vide variety of item banks, outline 
their advantages and disadvantages, list the conditions under which item banks 
have tha most potential value, and provide an extensive set of questions to 
be asked in designing item«^bankihg systems. 

Large-scale test development programs vill_ become increasingly computer- 
ized* Individual teachers can expect to assemble their tests from coxnputer- 
ized item banks as quality software and microcomputers become available. 

Administering the Tests 

j:_ _ _ ® _^?Z^ is computer admin is- 

"^f?^?* ^^^^ ??!^?^_^^® f9 fascinating is toe ability to 

program the computer to consider a student's prior suiswers when picking the 
next question; that is, to select items for administration based on the 
student' s previous responses. Thus, the examination given to each student 
can be tailored or adapted to his or her level of ability. It is this 
adaptive, tailored, f espbhse-cbhtihgeht feature that gives computer- 
administered testing its major advantage over conventional test 
administration. 

Adaptive testing^ as it is. most frequently called^ has been put to use 
to help solve three knotty ;tes ting problems. The first is getting more 
measxirement precision with. fewer test items. It is a fact of psychometric 
life that the more test items given to a student, the more accurately the 
student's level of achievement or ability can be assessed. But teachers and 
students alike object to tests that take a long time to complete. Because 
the level of difficulty of the items a student is given under adaptive 
testing corresponds to the student's level of performance, they carry 
nUiximum information about the student's ability^ with the result that 
adaptively administered tests can fsrbvide the same degree of precision 
as traditionally administered tests while using about half as many items. 



The second problem attacked 1^ adaptive testing is that of making test 
items simulate tasks that the student might face on the job or in other out- 
of-schooi situations* In adaptive testing^ _ the computer can be programmed 
to permit students_to progress tteough a program situation and to provide 
students with appropriate feedback. For example, in patient-'^management 
problems, a medical case is presented and the medical student indicates what 
actions should be taken. These actions might include observing the patient's 
physical condition, ordering laboratory tests, or prescribing medication or 
other treatments. The result of each action is given to the student^ who 
proceeds to answer additional questions about further treatment. 

The third problem that adaptive testing is well stiited to handle is 
diagnosis of student learning problems. Vihen a student misses atest question^ 
the computer can be prog^ administer carefully selected similar items 

that can pinpoint the studen^' s misconceptions or gaps in knowledge. With 
such information, the teacher can provide appropriate remedial instruction. 
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Although some large testing programs have begun to administer tests by 
cbmputerj with positive reactions frcMn those examined^ it will be some time 
before classroom teachers routinely give thelrtests by computer. Tests 
embedded in ihstructibhal comfiuter software are the exception* Qtiestions 
asked of learners are an integral part of the teaching material, and such 
testing is often so nbnintarusive ttiat the students are not aware they are 
tested* 

Scoring the Tests and Ana l yz t n g^^and^i nterpret i ng the Results 

?'®?'_°^5y_y?^^?t groups who administered many objective tests 

scored their own answer sheets by hand. Now desk top scoring machines 
99™?^?^§_?^_*_°^i^rocdmputer are available for a price that enables local 
schools and small colleges to have their own automated scoring and test 
reporting system. In a few more years ^ a majority of the medium- and 
large-^sized school districts may score and report objective tests using 
locally owned equipment. 

Computers fi^Y®_?^??^®"^?®^_ ^® score short-answer questions and to 
grade essays i The procedure typically consists of matching the student's 
answer to key words provided by the test constructor. If the student 
supplies tiie key words or acceptable variations , credit is given for the 
answer. Somewhat aside, it seems that the science of short-answer and essay 
test scoring has not made any noticeable progress in the last 10 or 15 
years, nor is it likely to do so using present methods. 

A traditional fuhctibh of cbmputers in testing has been to analyze 
item and test data. The prowess of computers to manipulate nu^ers has 
never been doubted^ and cbmputers continue to provide test developers with 
a much valued service in this regard. Using item data stored in item banks, 
some of the more sophisticated programs can predict the score distribution 
and other test results before a planned test is actually admin is t€sred. 



Computer interpretation of test results j particularly those of psycholo- 
gical tests, is the most cbhtrbyersial of all aspects of computer testing i 
Many computer companies how administer and interpret the results from 
interest, vbcatibhali^ personality and ihtelligence tests i The controversy 
stems in large part from the secrecy that surrounds the algorithms the 
computer uses to produce various interpretatibhsi How the computer decides 
that a job applicant is a good risk or that client has suicidal tendencies 
is often shrouded in proprietary secrecy, and the validity of these interpre- 
tatiohis remains mcertain. 

Keeping Test Records 

_ toother task to which computers are well suited is keeping track of test 
performance. Computers can store results in a record or grade book > produce 
grade reports, and develop a profile of test results for an individtial 
student or for the class as a whole. Microcomputer programs that perform 
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these functions are readily available and relatively inexpensive. The 
computer can be programmed to keep track of other statistics in addition to 
test scores: among these , the time taken to answer each queistibh^ the 
attractiveness of each foil in a muitipie-choice item^ and the proportion of 
students who answered each item correctly* 

As discussed here, ?^mputers_are employed in several areas of educational 
testing. _^Se_functions_of computers in these areas can be integrated , which 
say lead to more efficient and acceptable testing practices. Using items 
from a bank, the computer can assemble and administer a test and^ because 
the responses of computer-administered tests are entered directly into the 
computer, it can quickly score, record, and interpret the results. As 
computers and programs for carrying but these tasks become more readily 
available, we can expect a greater prbpbrtibh of testing activities to be 
aided by computer. Although the computer can zaeUce the process easier to 
ixnplemerit, the educational benefit that accrues to the student will depend 
on the quality "of the items that Eoake up the tests and on how the results 
are put to use. 
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